Skills Network Logo

Exploratory Data Analysis

Estimated time needed: 30 minutes

Exploratory Data Analysis (EDA) is the crucial process of using summary statistics and graphical representations to perform preliminary investigations on data to uncover patterns, detect anomalies, test hypotheses, and verify assumptions.

In this notebook, we will learn some interesting and useful data exploration techniques that can be applied to explore any geographical data.

Objectives

After completing this lab you will be able to:


Setup

For this lab, we will be using the following libraries:

Installing Required Libraries

The following required modules are pre-installed in the Skills Network Labs environment. However, if you run this notebook commands in a different Jupyter environment (e.g. Watson Studio or Ananconda) you will need to install these libraries by removing the # sign before !mamba in the code cell below.

Reading and understanding our data

The dataset in this lab is Monthly average retail prices for gasoline and fuel oil, by geography . It is available through Statistics Canada and includes monthly average gasoline price (Cents per Litre), of major Canadian Cities, starting from 1979 until recent.

Another dataset, canada_provinces.geojson, contains the mapping information of all Canadian Provinces. It will be used in our analysis to produce a choropleth map.

Let's read the data into pandas dataframe and look at the first 5 rows using the head() method.

Let's find out how many entries there are in our dataset, using shape function.

Using info function, we will take a look at our types of data.

Using columns method, we will print all the column names.

Below, we will check for any missing values.

Data Wrangling

Selecting and renaming the columns of interest

Below, we are filtering our data, by selecting only the relevant columns. Also, we are using the rename() method to change the name of the columns.

Splitting the columns

The str.split() function splits the string records, by a 'comma', with n=1 slplit, and Expend=True , returns a dataframe. Below, we are splitting 'GEO' into 'City' and 'Province'.

Changing to datetime format

If we scroll up to our gasoline.info() section, we can find that 'REF_DATE' is an object type. To be able to filter by day, month, or year, we need to change the format from object type to datetime. Pandas function to_datetime() transforms to date time format. Also, we need to specify the format of datetime that we need. In our case, format='%b-%y' means that it will split into the name of a month and year. str.slice(stop=3) splits and outputs the first 3 letters of a month. For more information on how to transform to datetime, please visit this pandas documentation. Also, this web page contains more information on datetime formats.

The describe() function provides statistical information about the numeric variables. Since we only have the 'VALUE' variable that we want statistical information on, we will filter it by data.VALUE.describe() function.

Now, it is useful to know what is inside our categorical variables. We will use unique().tolist() functions to print out all of our 'GEO' colunm.

Exercise 1

In this exercise, print out all categories in 'TYPE' column.

Solution (Click Here)     data.TYPE.unique().tolist()

Data Filtering

This section will introduce you to some of the most common filtering techniques when working with pandas dataframes.

Filtering with logical operators

We can use the logical operators on column values to filter rows. First, we specify the name of our data, then, square brackets to select the name of the column, double 'equal' sign, '==' to select the name of a row group, in single or double quotation marks. If we want to exclude some entries (e.g. some locations), we would use the 'equal' and 'exclamation point' signs together, '=!'. We can also use '\</>', '<=/>=' signs to select numeric information.

Let's select the Calgary, Alberta data to see all the information.

Now, let's select 2000 year.

Filtering by multiple conditions

There are many alternative ways to perform filtering in pandas. We can also use '|' ('or') and '&' (and) to select multiple columns and rows.

For example, let us select Toronto and Edmonton locations.

Alternatively, we can use isin method to select multiple locations.

Exercise 2 a

In this exercise, please use the examples shown above, to select the data that shows the price of the 'household heating fuel', in Vancouver, in 1990.

Solution (Click Here)     exercise2a = data[( data['Year'] == 1990) & (data['TYPE'] == "Household heating fuel") & (data['City']=='Vancouver')] exercise2a

Exercise 2 b

In this exercise, please select the data that shows the price of the 'household heating fuel', in Vancouver, in the years of 1979 and 2021.

Solution (Click Here)     exercise2b = data[( data['Year'] <= 1979) | ( data['Year'] == 2021) & (data['TYPE'] == "Household heating fuel") & (data['City']=='Vancouver')] exercise2b
Hint (Click Here)     If we use '&' operator between the two years, it will return an empty data frame. This is because there was no data for the 'household heating fuel, in Vancouver, in 1979. Using 'or' operator is suitable because either one of two years that contains any information on 'household heating fuel' in Vancouver.

Filtering using groupby() method

The role of groupby() is to analyze data by some categories. The simplest call is by a column name. For example, let’s use the 'GEO' column and ngroups function to calculate the number of groups (cities, provinces) in 'GEO' column.

Most commonly, we use groupby() to split the data into groups,this will apply some function to each of the groups (e.g. mean, median, min, max, count), then combine the results into a data structure. For example, let's select the 'VALUE' column and calculate the mean of the gasoline prices per year. First, we specify the 'Year" column, following by the 'VALUE' column, and the mean() function.

Exercise 3 a

In the cell below, please use groupby() method to group by the maximum value of gasoline prices, for each month.

Solution (Click Here)     exercise3a = data.groupby(['Month'])['VALUE'].max()

Exercise 3 b

In the cell below, please use groupby() method to group by the median value of gasoline prices, for each year and each city.

Solution (Click Here)     exercise3b = data.groupby(['Year', 'City'])['VALUE'].median()
Hint (Click Here)     We can also reset the index of the new data output, by using `reset_index()`, and round up the output values to 2 decimal places. exercise3b = data.groupby(\['Year', 'City'])\['VALUE'].median().reset_index(name ='Value').round(2)

Visualizing the data with pandas plotly.express

The plotly.express library (usually imported as px) contains functions that can create entire figures at once. plotly.express is a built-in part of the plotly library, and makes creation of most common figures very easy. For more information on plotly.express, please refer to this documentation.

Here, we will plot the prices of gasoline in all cities during 1979 - 2021.

Here, we will plot the average monthly prices of gasoline in Toronto for the year of 2021.

Exercise 4

In the cell below, use plotly.express or other libraries, to plot the annual average gasoline price, per year, per gasoline type.

Solution (Click Here)     type_gas = data.groupby(['Year', 'TYPE'])['VALUE'].mean().reset_index(name ='Type').round(2) fig = px.line(type_gas, x='Year', y = "Type", color = "TYPE", color_discrete_sequence=px.colors.qualitative.Light24) fig.update_traces(mode='markers+lines') fig.update_layout( title="Fuel Type Price Trend", xaxis_title="Year", yaxis_title="Annual Average Price, Cents per Litre") fig.show()

We can also use the animated time frame to show the trend of gasoline prices over time.

Another way to display the distribution of average gasoline prices in Canadian Provinces is by plotting a map. We will use 2021 year to display the average gasoline price in all Canadian Provinces. First, we select the year.

Then, we group by the 'Province' and the 'mean' values of gasoline prices per each province. We also need to index each province with province id.

Here, we are linking each province by its specified 'provinceID' with another dataset, ‘canada_provinces.geojson’, containing all the mapping information for plotting our provinces.

First, we need to download the Canadian Provinces dataset from IBM cloud storage, using the requests.get() function.

Next, we will load the file as a string, using json.loads() function.

Exercise 5

In this exercise, experiment with different color scales to make the visualization easier to read. Some suggestions are provided in the "Hint" section. Simply copy the above code and replace 'px.colors.diverging.Tropic', with any other color scales. For example, the sequential color scales are appropriate for most continuous data, but in some cases it can be helpful to use a diverging or cyclical color scale. Diverging color scales are appropriate for the continuous data that has a natural midpoint. For more information on plotly colors, please visit this plotly documentation web page.

Hint (Click Here)     px.colors.diverging.Tropic px.colors.diverging.Temps px.colors.sequential.Greens px.colors.sequential.Reds

Congratulations! - You have completed the lab

Author

Svitlana Kramar

Change Log

Date (YYYY-MM-DD) Version Changed By Change Description
2022-01-18 0.1 Svitlana K. Added Introduction

Copyright © 2020 IBM Corporation. All rights reserved.